skip to main content


Search for: All records

Creators/Authors contains: "Bender, M."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Background Quantification of gene expression from RNA-seq data is a prerequisite for transcriptome analysis such as differential gene expression analysis and gene co-expression network construction. Individual RNA-seq experiments are larger and combining multiple experiments from sequence repositories can result in datasets with thousands of samples. Processing hundreds to thousands of RNA-seq data can result in challenges related to data management, access to sufficient computational resources, navigation of high-performance computing (HPC) systems, installation of required software dependencies, and reproducibility. Processing of larger and deeper RNA-seq experiments will become more common as sequencing technology matures. Results GEMmaker, is a nf-core compliant, Nextflow workflow, that quantifies gene expression from small to massive RNA-seq datasets. GEMmaker ensures results are highly reproducible through the use of versioned containerized software that can be executed on a single workstation, institutional compute cluster, Kubernetes platform or the cloud. GEMmaker supports popular alignment and quantification tools providing results in raw and normalized formats. GEMmaker is unique in that it can scale to process thousands of local or remote stored samples without exceeding available data storage. Conclusions Workflows that quantify gene expression are not new, and many already address issues of portability, reusability, and scale in terms of access to CPUs. GEMmaker provides these benefits and adds the ability to scale despite low data storage infrastructure. This allows users to process hundreds to thousands of RNA-seq samples even when data storage resources are limited. GEMmaker is freely available and fully documented with step-by-step setup and execution instructions. 
    more » « less
  2. In each step of the cup game on n cups, a filler distributes up to 1" water among the cups, and then an emptier removes 1 unit of water from a single cup. The emptier’s goal is to minimize the height of the fullest cup, also known as the backlog. The cup emptying game has found extensive applications to processor scheduling, network- switch buffer management, quality of service guarantees, and data- structure deamortization. The greedy emptying algorithm (i.e., always remove from the fullest cup) is known to achieve backlog O(log n) and to be the optimal deterministic algorithm. Randomized algorithms can do significantly better, achieving backlog O(log log n) with high probability, as long as " is not too small. In order to achieve these improvements, the known randomized algorithms require that the filler is an oblivious adversary, unaware of which cups the emptier chooses to empty out of at each step. Such randomized guarantees are known to be impossible against fully adaptive fillers. We show that, even when the filler is just slightly non- adaptive, randomized emptying algorithms can still guarantee a backlog of O(log log n). In particular, we give randomized ran- domized algorithms against an elevated adaptive filler, which is an adaptive filler that can see the precise fills of every cup containing more than 3 units of water, but not of the cups containing less than 3 units. 
    more » « less
  3. First introduced in 1954, the linear-probing hash table is among the oldest data structures in computer science, and thanks to its unrivaled data locality, linear probing continues to be one of the fastest hash tables in practice. It is widely believed and taught, however, that linear probing should never be used at high load factors; this is because of an effect known as primary clustering which causes insertions at a load factor of 1 - 1/x to take expected time O(x2) (rather than the intuitive running time of ⇥(x)). The dangers of primary clustering, first discovered by Knuth in 1963, have now been taught to generations of computer scientists, and have influenced the design of some of the most widely used hash tables in production. We show that primary clustering is not the foregone conclusion that it is reputed to be. We demonstrate that seemingly small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions: if these design decisions are made correctly, then even if a hash table operates continuously at a load factor of 1 - (1/x), the expected amortized cost per insertion/deletion is O(x). This is because the tombstones left behind by deletions can actually cause an anti-clustering effect that combats primary clustering. Interestingly, these design decisions, despite their remarkable effects, have historically been viewed as simply implementation-level engineering choices. We also present a new variant of linear probing (which we call graveyard hashing) that completely eliminates primary clustering on any sequence of operations: if, when an operation is performed, the current load factor is 1 1/x for some x, then the expected cost of the operation is O(x). Thus we can achieve the data locality of traditional linear probing without any of the disadvantages of primary clustering. One corollary is that, in the external-memory model with a data blocks of size B, graveyard hashing offers the following remarkably strong guarantee: at any load factor 1 1/x satisfying x = o(B), graveyard hashing achieves 1 + o(1) expected block transfers per operation. In contrast, past external-memory hash tables have only been able to offer a 1 + o(1) guarantee when the block size B is at least O(x2). Our results come with actionable lessons for both theoreticians and practitioners, in particular, that well- designed use of tombstones can completely change the asymptotic landscape of how the linear probing behaves (and even in workloads without deletions). 
    more » « less
  4. Given an input stream of size N , a -heavy hiter is an item that occurs at least N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = N - th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space ((N ) words). Thus in-RAM heavy-hitters algorithms typically sacrfice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to exter- nal memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeli- ness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between report- ing delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded ver- sions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., approx 100K observations per second. 
    more » « less
  5. null (Ed.)
  6. Gene Expression Matrices (GEMs) are a fundamental data type in the genomics domain. As the size and scope of genomics experiments increase, researchers are struggling to process large GEMs through downstream workflows with currently accepted practices. In this paper, we propose a methodology to reduce the size of GEMs using multiple approaches. Our method partitions data into discrete fields based on data type and employs state-of-the-art lossless and lossy compression algorithms to reduce the input data size. This work explores a variety of lossless and lossy compression methods to determine which methods work the best for each component of a GEM. We evaluate the accuracy of the compressed GEMs by running them through the Knowledge Independent Network Construction (KINC) workflow and comparing the quality of the resulting gene co-expression network with a lossless control to verify result fidelity. Results show that utilizing a combination of lossy and lossless compression results in compression ratios up to 9.77× on a Yeast GEM, while still preserving the biological integrity of the data. Usage of the compression methodology on the Cancer Cell Line Encyclopedia(CCLE) GEM resulted in compression ratios up to 9.26×. By using this methodology, researchers in the Genomics domain may be able to process previously inaccessible GEMs while realizing significant reduction in computational costs. 
    more » « less
  7. null (Ed.)
    Two additions impacting tables 3 and 4 in ref. [1] are presented in the following. No significant impact is found for other results or figures in ref. [1]. 
    more » « less